Authorship Attribution and Optical Character Recognition Errors
نویسندگان
چکیده
Stylometric authorship attribution is a fundamental problem. The basic idea behind the research is that one can determine the authorship of a document on the basis of cognitive and linguistic quirks that uniquely identify a person. In many cases, however, noise in the original documents can make this analysis more difficult and less reliable. We investigate the errors introduced by a typical optical character recognition (OCR) process. Using simulated (random) errors in a standard benchmark corpus, we test to see how sensitive the authorship attribution process is to character mis-recognition. Our results indicate that, while accuracy decreases measurably with noise, the decrease is not substantial. RÉSUMÉ. Le problème de l’attribution stylométrique d’auteur est un problème fondamental. L’idée fondamentale derrière cette recherche est que l’on peut déterminer la paternité d’un document sur la base d’un ensemble de trait cognitifs et linguistiques qui permettent d’identifier de manière unique le style d’écriture d’une personne. Dans de nombreux cas, cependant, le bruit présent dans les documents originaux peut rendre cette analyse plus difficile et moins fiable. Nous étudions les erreurs introduites par un processus typique de reconnaissance optique de caractères (OCR). En utilisant des erreurs simulées (aléatoirement) dans un corpus de référence standard, nous évaluons la sensibilité au bruit du processus d’attribution d’auteur. Nos résultats indiquent que, bien que la précision diminue avec un niveau de bruit, cette baisse n’est pas substantielle.
منابع مشابه
Attributing Authorship in the Noisy Digitized Correspondence of Jacob and Wilhelm Grimm
This article presents the results of a multidisciplinary project aimed at better understanding the impact of different digitization strategies in computational text analysis. More specifically, it describes an effort to automatically discern the authorship of Jacob and Wilhelm Grimm in a body of uncorrected correspondence processed by HTR (Handwritten Text Recognition) and OCR (Optical Characte...
متن کاملOn musical stylometry—a pattern recognition approach
In this short communication we describe some experiments in which methods of statistical pattern recognition are applied for musical style recognition and disputed musical authorship attribution. Values of a set of 20 features (also called ‘‘style markers’’) are measured in the scores of a set of compositions, mainly describing the different sonorities in the compositions. For a first study ove...
متن کاملOn the Robustness of Authorship Attribution Based on Character N-gram Features
A number of independent authorship attribution studies have demonstrated the effectiveness of character n-gram features for representing the stylistic properties of text. However, the vast majority of these studies examined the simple case where the training and test corpora are similar in terms of genre, topic, and distribution of the texts. Hence, there are doubts whether such a simple and lo...
متن کاملThe use of sampling techniques in the retention of records: A RAMP study with guidelines
Optical Character Recognition (OCR) document. WARNING! Spelling errors might subsist. In order to access to the original document in image form, click on "Original" button on 1st page. Optical Character Recognition (OCR) document. WARNING! Spelling errors might subsist. In order to access to the original document in image form, click on "Original" button on 1st page. Optical Character Recogniti...
متن کاملAuthorship Attribution using Compression Distances
Authorship attribution has been a field of interest for researchers in the past, especially for forensic purposes. In this thesis, to obtain the degree of Bachelor of Science from the Leiden University, we investigate character n-grams and so-called compression distances to prototypes on several datasets, i.e., the datasets provided by PAN Labs (a benchmarking activity on uncovering plagiarism,...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- TAL
دوره 53 شماره
صفحات -
تاریخ انتشار 2012